Nice Plotly Features

I made notes outlining the useful features of plotly here.

From that list I think some worth exploring include:

  1. Subplots
  2. Custom hover info
  3. Custom controls (buttons, dropdowns, sliders, etc.)

I’m going to combine these to create a cool visualization (hopefully without too much effort).

The Topic

I want to use something non-controversial that people generally just tend to find interesting. I was initially thinking savings vs expenditures by state or country as a picture of household financial health but access to the data is not as straightforward as I’d expected and economics are so overdone. So, I’ll go with baby names in the US.

The Plan

I’ll create 2 plots as follows:

  1. Choropleth map applying to most recent year of data
    • Color = overall US ranking of #1 name in state
      • Question answered: Where does each state’s most popular name fall among all US names? Should allow us to determine which states prefer less popular names.
      • Hypothesis: The most popular name in the US will most likely also be the most popular in states with the largest number of births (I hadn’t planned to calculate this but I might).
    • Custom hover info = Most popular baby name in state, overall US ranking, and number of states where this name is most popular
    • Button = (optional) toggle boy/girl names
  2. Line graph of top 10 baby names over time
    • y-axis = ranking
    • x-axis = year
    • Range slider = change time (to zoom in on period of interest)
    • Custom hover info = name, rank (x-axis), year (y-axis)
      • other info? … maybe all-time popularity (over all years), number of years at #1
    • NOTE: I’m going to have to keep an eye on the legend to make sure it doesn’t get too large… and fitting this with the chloropleth might be a bit tricky. Some adjustments will need to be made probably.

Get the Data

Thank you US Social Security Administration! The data comes zipped as .txt files by year (for National data) or by state, with 1 file per year or state respectively. I’m going to have to combine this data into 2 dataframes (instead of a ton of files). The data in files are separated by commas.

download.file("https://www.ssa.gov/OACT/babynames/names.zip",
              destfile = "./plotly_data/names.zip") # national data
download.file("https://www.ssa.gov/OACT/babynames/state/namesbystate.zip",
              destfile = "./plotly_data/namesbystate.zip") # state data

# time of download
Sys.time() %>%
    format(., format = "%e %b %Y at %H:%M:%S", usetz = TRUE) %>%
    saveRDS(., file = "./plotly_data/dl_timestamp.Rdata")
# unzip files to tempdir & list
td <- tempdir()
unzip("./plotly_data/names.zip", exdir = td)
unzip("./plotly_data/namesbystate.zip", exdir = td)
US_txt <- list.files(td, "^yob.*\\.txt$", full.names = TRUE)
state_txt <- list.files(td, "^[A-Z]{2}\\.TXT$", full.names = TRUE)

# combine into 2 dataframes
US_baby <- map(US_txt,
               ~ read_csv(.x,
                         col_names = c("Name", "Sex", "n"),
                         col_types = "cci"
                         ) %>%
                   mutate(., Year = as.integer(str_extract(.x, "[0-9]{4}"))) # add year
               ) %>%
           reduce(rbind)

state_baby <- map(state_txt,
               ~ read_csv(.x,
                         col_names = c("State", "Sex", "Year", "Name", "n"),
                         col_types = "ccici"
                         )
               ) %>%
           reduce(rbind)

Create the plots

Calculations

US_no1 <- US_baby %>%
    group_by(., Sex, Year) %>%
    filter(., n == max(n)) %>%
    add_count(., Name)
US_no1
yrs_no1 <- ungroup(US_no1) %>%
    count(., Name, Sex, sort = TRUE)
yrs_no1
US_2017 <- filter(US_baby, Year == 2017) %>%
    arrange(., desc(n)) %>%
    mutate(., US_rank = row_number())

US_2017
state_last <- filter(state_baby, Year == max(Year)) %>%
    group_by(., State, Sex) %>%
    filter(., n == max(n)) %>%
    add_count(., sort = TRUE) %>%
    left_join(., select(US_2017, Name, Sex, US_rank), by = c("Name", "Sex"))

state_no1 <- ungroup(state_last) %>%
    count(., Name, Sex) %>%
    rename(., num_states = nnn)
state_no1
state_comb <- left_join(state_last, state_no1, by = c("Name", "Sex")) %>%
    mutate(., # combine info for ties
       Name2 = if_else(
           lead(paste(State, Sex, Year, n), 1) == paste(State, Sex, Year, n),
           paste(Name, lead(Name, 1), sep = " / "),
           NA_character_,
           missing = Name
           ),
       US_rank2 = if_else(
           lead(paste(State, Sex, Year, n), 1) == paste(State, Sex, Year, n),
           paste(US_rank, lead(US_rank, 1), sep = " / "),
           NA_character_,
           missing = as.character(US_rank)
           ),
       num_states2 = if_else(
           paste(State, Sex, Year, n) == lead(paste(State, Sex, Year, n), 1),
           paste(num_states, lead(num_states, 1), sep = " / "),
           NA_character_,
           missing = as.character(num_states)
           )
     )

# remove second line for ties, add hover text
state_final <- filter(state_comb, nn == 1 | (nn == 2 & grepl("/", Name2))) %>%
    mutate(.,
           hover = paste(
               if_else(nn > 1, "TIE - ", ""), Name2,
               "<br>Total births = ", n,
               "<br>US rank = ", US_rank2,
               "<br>No. of states where #1 = ", num_states2,
               sep = ""
               )
    )
state_final <- ungroup(state_final) %>%
    mutate(., Sex = recode(Sex, "M" = "Boy", "F" = "Girl")) %>%
    left_join(., as_tibble(cbind(state.abb, state.name)), by = c("State" = "state.abb"))

state_final

Create the choropleth

Got the transform and button info from: https://community.plot.ly/t/need-help-on-using-dropdown-to-filter/6596/2

Color info from: https://moderndata.plot.ly/create-colorful-graphs-in-r-with-rcolorbrewer-and-plotly/

This choropleth works and the button changes the data like I want but I can’t figure out how to get the data change AND the color change.

# specify some map projection/options
ch_opts <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = FALSE
)

p <- plot_geo(state_final, locationmode = 'USA-states') %>%
    add_trace(., type = "choropleth",
        z = ~US_rank, text = ~hover, locations = ~State,
        color = ~US_rank, colors = "PuRd", 
        transforms = list(
            list(
                type = "filter",
                target = ~Sex,
                operation = "=",
                value = c("Boy", "Girl"),
                z = ~US_rank
            )
        )
        ) %>%
    colorbar(title = "U.S. Rank") %>%
    layout(
        title = '2017 Most Popular Baby Names in US by State<br>(Hover for breakdown)',
        geo = ch_opts,
        updatemenus = list(
            list(
                type = 'buttons',
                active = 0,
                buttons = list(
                    list(method = "restyle",
                         args = list("transforms[0].value", "Boy"),
                         label = "Boy"),
                    list(method = "restyle",
                         args = list("transforms[0].value", "Girl"),
                         label = "Girl")
                )
            )
        )
    )


p

Ask SO

The documentation on plotly for R is just too poor. There are a handful of examples on how to enter something but none of the explanation about why a certain approach is used (and multiple instances in the documentation that say “all the options follow below” without any options.

Time to ask the Stack Overflow community. Creating a minimal, reproducible example.

# prep data for SO
df <- select(state_final, code = State, sex = Sex, us_rank = US_rank) %>%
    mutate(., sex = recode(sex, "Boy" = "M", "Girl" = "F"))

dput(df)

# put everything below this on SO
    df <- structure(list(code = c("AK", "WI", "WY", "AK", "AL", "AL", "AR", 
    "AR", "AZ", "AZ", "CA", "CA", "CO", "CO", "CT", "CT", "DC", "DC", 
    "DE", "DE", "FL", "FL", "GA", "GA", "HI", "HI", "IA", "IA", "ID", 
    "ID", "IL", "IL", "IN", "IN", "KS", "KS", "KY", "KY", "LA", "LA", 
    "MA", "MA", "MD", "MD", "ME", "ME", "MI", "MI", "MN", "MN", "MO", 
    "MO", "MS", "MS", "MT", "MT", "NC", "NC", "ND", "ND", "NE", "NE", 
    "NH", "NH", "NJ", "NJ", "NM", "NM", "NV", "NV", "NY", "NY", "OH", 
    "OH", "OK", "OK", "OR", "OR", "PA", "PA", "RI", "RI", "SC", "SC", 
    "SD", "SD", "TN", "TN", "TX", "TX", "UT", "UT", "VA", "VA", "VT", 
    "VT", "WA", "WA", "WI", "WV", "WV", "WY"), sex = c("F", "F", 
    "M", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", 
    "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", 
    "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", 
    "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", 
    "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", 
    "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", 
    "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", 
    "M", "F", "M", "F", "M", "M", "F", "M", "F"), us_rank = c(1L, 
    1L, 2L, 9L, 5L, 7L, 1L, 14L, 1L, 2L, 1L, 4L, 1L, 2L, 3L, 4L, 
    5L, 9L, 3L, 10L, 6L, 2L, 5L, 7L, 1L, 2L, 28L, 15L, 1L, 15L, 3L, 
    4L, 1L, 15L, 1L, 15L, 1L, 7L, 3L, 2L, 1L, 11L, 5L, 2L, 18L, 15L, 
    1L, 2L, 3L, 15L, 3L, 7L, 5L, 7L, 3L, 9L, 5L, 7L, 1L, 15L, 1L, 
    15L, 18L, 10L, 1L, 2L, 13L, 4L, 1L, 2L, 3L, 2L, 1L, 2L, 1L, 7L, 
    1L, 15L, 1L, 2L, 18L, 17L, 5L, 7L, 1L, 15L, 5L, 7L, 1L, 4L, 3L, 
    15L, 3L, 2L, 26L, 37L, 3L, 2L, 29L, 1L, 2L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -102L))

    library(magrittr)
    library(plotly)
    
    geo_opts <- list(
      scope = 'usa',
      projection = list(type = 'albers usa'),
      showlakes = FALSE
    )

    plot_geo(df, locationmode = 'USA-states') %>%
        add_trace(., type = "choropleth",
            z = ~us_rank, locations = ~code, color = ~us_rank, 
            transforms = list(
                list(
                    type = "filter",
                    target = ~sex,
                    operation = "=",
                    value = c("M", "F")
                )
            )
            ) %>%
        layout(
            geo = geo_opts,
            updatemenus = list(
                list(
                    type = 'buttons',
                    active = 0,
                    buttons = list(
                        list(method = "restyle",
                             args = list("transforms[0].value", "M"),
                             label = "Boy"),
                        list(method = "restyle",
                             args = list("transforms[0].value", "F"),
                             label = "Girl")
                    )
                )
            )
        )

Choropleths side-by-side

Notes about Layout():

  • Needs to be called with each choropleth to provide geo options
  • Title can only be specified for entire plot (if multiple, value will be last title = call in layout()
  • How to control subplot margins = unknown; call in individual plot layout has no effect
  • Cannot move 2nd colorbar/legend (‘boys’ map)
  • Annotations seem to work better when called in layout() after calling subplot()
boy <- filter(state_final, Sex == "Boy")
girl <- filter(state_final, Sex == "Girl")

# specify some map projection/options
ch_opts <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = FALSE
)

boy_ch <- plot_geo(boy, locationmode = 'USA-states') %>%
    add_trace(., type = "choropleth",
        z = ~US_rank, text = ~hover, locations = ~State,
        zauto = FALSE, zmin = 1, zmax = 30,
        color = ~US_rank, colors = "GnBu", reversescale = TRUE
        ) %>%
    colorbar(title = "U.S. Rank", y = -1) %>%
    layout(., geo = ch_opts)


girl_ch <- plot_geo(girl, locationmode = 'USA-states') %>%
    add_trace(., type = "choropleth",
        z = ~US_rank, text = ~hover, locations = ~State,
        zauto = FALSE, zmin = 1, zmax = 30,
        color = ~US_rank, colors = "PuRd", reversescale = TRUE
        ) %>%
    colorbar(., title = "U.S. Rank", x = 0.45) %>%
    layout(., geo = ch_opts)

subplot(girl_ch, boy_ch) %>%
    layout(annotations = list(
        list(
            text = "2017 Girl Names",
            xref = "paper", xanchor = "center", x = 0.25,
            yref = "paper", yanchor = "bottom", y = 0.1,
            align = "center",
            showarrow = FALSE
            ),
        list(
            text = "2017 Boy Names",
            xref = "paper", xanchor = "center", x = 0.75,
            yref = "paper", yanchor = "bottom", y = 0.1,
            align = "center",
            showarrow = FALSE
            )
        )
    )

I’m going to leave this where it is for now. It’s definitely not what I was hoping for but the documentation is rough.

Top 5 names by year

top5 <- group_by(US_baby, Sex, Year) %>%
    top_n(., n = 5, wt = n) %>%
    arrange(., desc(n)) %>%
    mutate(., rank = row_number())

top5_50y <- ungroup(top5) %>%
    filter(., Year >= (max(Year) - 50)) %>%
    mutate(., Sex = recode(Sex, "M" = "Boy", "F" = "Girl"))
arrange(top5_50y, Year, Sex)
unique(top5_50y$Year) %>% sort()
##  [1] 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## [15] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## [29] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## [43] 2009 2010 2011 2012 2013 2014 2015 2016 2017
girl5 <- filter(top5_50y, Sex == "Girl")
boy5 <- filter(top5_50y, Sex == "Boy")

girl_n <- n_distinct(girl5$Name)
boy_n <- n_distinct(boy5$Name)

col_palette <- colorRampPalette(brewer.pal(9, "Set1"))

n_distinct(top5$Name) # count total unique names (need a color palette with enough colors)
## [1] 72
girl_g <- ggplot(data = girl5, aes(x = Year, y = rank)) +
    geom_line(aes(color = Name)) +
    scale_y_reverse()

boy_g <- ggplot(data = boy5, aes(x = Year, y = rank)) +
    geom_line(aes(color = Name)) +
    scale_y_reverse()

col_palette <- colorRampPalette(brewer.pal(9, "Set1"))

subplot(girl_g, boy_g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
# boy/girl split by facet (too many names)
ggplot(data = top5, aes(x = Year, y = rank)) +
    geom_point(aes(color = Name)) +
    facet_grid(Sex ~ .)

plot_ly(data = top5, x = ~Year, y = ~rank,
        color = ~Name,
        type = "scatter", mode = "lines")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
# doesn't work with transforms...
us_p <- plot_ly(data = top5, x = ~Year, y = ~rank,
        color = ~Name, colors = col_palette(n_distinct(top5_50y$Name)),
        type = "scatter", mode = "lines",
        transforms = list(
            list(
                type = 'filter',
                target = 'Sex',
                operation = '=',
                value = c("Boy", "Girl")
                )
            )
  )

us_p %>%
    layout(.,
           updatemenus = list(
               list(
                type = 'buttons',
                active = 0,
                buttons = list(
                    list(method = "restyle",
                         args = list("transforms[0].value", "Boy"),
                         label = "Boy"),
                    list(method = "restyle",
                         args = list("transforms[0].value", "Girl"),
                         label = "Girl")
                    )
                )
           )
    )

Make lines appear by using simpler case (works!)

col_palette <- colorRampPalette(brewer.pal(9, "PuRd")[5:9])

# testing out package `hues` (clustering to get n distinct colors)
library(hues)

col_palette(girl_n)
##  [1] "#DF65B0" "#E05BA9" "#E151A3" "#E2489D" "#E43E97" "#E53491" "#E62B8B"
##  [8] "#E42683" "#E0227B" "#DC1E73" "#D81B6A" "#D41762" "#D0135A" "#C91054"
## [15] "#C10D51" "#B80A4E" "#AF074B" "#A70548" "#9E0245" "#960041" "#8E003B"
## [22] "#860036" "#7E0030" "#76002A" "#6E0024" "#67001F"
iwanthue(girl_n) %>% attributes()
## $names
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
# girls plot
arrange(girl5, Year) %>%
plot_ly(data = ., type = 'scatter', mode = 'lines+markers',
        x = ~Year, y = ~rank, color = ~Name, colors = unname(iwanthue(girl_n))
) %>%
    layout(.,
           yaxis = list(
               tickmode = 'linear', tick0 = 1, dtick = 1,
               autorange = "reversed",
               title = "Rank"
               ),
           xaxis = list(
               title = "Year"
           )
        )
# boys plot
arrange(boy5, Year) %>%
plot_ly(data = ., type = 'scatter', mode = 'lines+markers', connectgaps = FALSE,
        x = ~Year, y = ~rank, color = ~Name, colors = unname(iwanthue(boy_n))
) %>%
    layout(.,
           yaxis = list(
               tickmode = 'linear', tick0 = 1, dtick = 1,
               autorange = "reversed",
               title = "Rank"
               ),
           xaxis = list(
               title = "Year"
           )
        )
# just discovered that lines connect without gaps --> need NA values to create gaps
#   in plotly

How do I add the missing years with NA values in rank? (This is something I should be able to do)

girl5
filter(girl5, Name == "Abigail")
# years 1967 - 2017
t <- spread(girl5, key = Name, value = rank) %>%
    gather(., key = "Name", value = "rank", -Sex, -n, -Year, na.rm = FALSE)
t
count(t, Name)

This makes n incorrect. A better way to do this might be to go back to the original data, identify the top 5 names each year, filter every name, then either:

  1. create rank for the top 5 only (and leave NA everywhere else)
  2. ranke each name for every year and replace anything less than 5 with NA (or leave the rank)
tmp_50yr <- mutate(US_baby, Sex = recode(Sex, "F" = "Girl", "M" = "Boy")) %>%
    filter(., Year >= max(Year) - 50) %>%
    group_by(., Sex, Year) %>%
    arrange(., desc(n)) %>%
    mutate(., Rank = row_number())

top_girls <- filter(tmp_50yr, Rank <= 5 & Sex == "Girl") %>%
    {unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too

top_boys <- filter(tmp_50yr, Rank <= 5 & Sex == "Boy") %>%
    {unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too


girl5 <- filter(tmp_50yr, Name %in% top_girls & Sex == "Girl") %>%
    arrange(., Name, Year)
boy5 <- filter(tmp_50yr, Name %in% top_boys & Sex == "Boy") %>%
    arrange(., Year)
girl_n <- n_distinct(girl5$Name)
boy_n <- n_distinct(boy5$Name)

plot_ly(data = girl5, type = 'scatter', mode = 'lines+markers',
        x = ~Year, y = ~Rank, color = ~Name, colors = unname(iwanthue(girl_n))
        ) %>%
    layout(., yaxis = list(autorange = "reversed"))
filter(girl5, Name == "Madison") 

Well, that worked but I had assumed that the top 5 names for any given year would always be in the top 1000 and therefore have a rank for each year. That is not the case for all names (example: Madison).

Well, I could go back to the spread/gather approach and before spreading paste n and Rank together so they are spread correctly and then split them after gathering. I could afterward replace NA values in Rank with >1000 but I don’t think I’ll do that. I think I’ll just leave it NA.

glimpse(girl5)
## Observations: 1,314
## Variables: 5
## $ Name <chr> "Abigail", "Abigail", "Abigail", "Abigail", "Abigail", "A...
## $ Sex  <chr> "Girl", "Girl", "Girl", "Girl", "Girl", "Girl", "Girl", "...
## $ n    <int> 245, 263, 280, 362, 392, 394, 479, 605, 614, 829, 796, 10...
## $ Year <int> 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 197...
## $ Rank <int> 605, 592, 591, 526, 491, 468, 413, 357, 354, 278, 297, 24...
girl5_2 <- unite(girl5, col = n_Rank, n, Rank, sep = "_") %>%
    spread(., key = Year, value = n_Rank) %>%
    gather(., key = "Year", value = "n_Rank", -Name, -Sex, na.rm = FALSE) %>%
    separate(., col = n_Rank, into = c("n", "Rank"), sep = "_", remove = TRUE, convert = TRUE)

plot_ly(data = girl5_2, type = 'scatter', mode = 'lines+markers',
        x = ~Year, y = ~Rank, color = ~Name, colors = unname(iwanthue(girl_n))
        ) %>%
    layout(., yaxis = list(autorange = "reversed"))

Just too busy, will scale down to the top 2 names each year

top2_50yr <- mutate(US_baby, Sex = recode(Sex, "F" = "Girl", "M" = "Boy")) %>%
    filter(., Year >= max(Year) - 50) %>%
    group_by(., Sex, Year) %>%
    arrange(., desc(n)) %>%
    mutate(., Rank = min_rank(-n))

top2_girls <- filter(top2_50yr, Rank <= 2 & Sex == "Girl") %>%
    {unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too

girl2 <- filter(top2_50yr, Name %in% top2_girls & Sex == "Girl") %>%
    arrange(., Name, Year) %>%
    unite(., col = n_Rank, n, Rank, sep = "_") %>%
    spread(., key = Year, value = n_Rank) %>%
    gather(., key = "Year", value = "n_Rank", -Name, -Sex, na.rm = FALSE) %>%
    separate(., col = n_Rank, into = c("n", "Rank"), sep = "_",
             remove = TRUE, convert = TRUE)

top2_boys <- filter(top2_50yr, Rank <= 2 & Sex == "Boy") %>%
    {unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too

boy2 <- filter(top2_50yr, Name %in% top2_boys & Sex == "Boy") %>%
    arrange(., Year) %>%
    unite(., col = n_Rank, n, Rank, sep = "_") %>%
    spread(., key = Year, value = n_Rank) %>%
    gather(., key = "Year", value = "n_Rank", -Name, -Sex, na.rm = FALSE) %>%
    separate(., col = n_Rank, into = c("n", "Rank"), sep = "_",
             remove = TRUE, convert = TRUE)
plot_ly(data = girl2, type = 'scatter', mode = 'lines+markers',
        x = ~Year, y = ~Rank, color = ~Name, colors = unname(iwanthue(girl_n))
        ) %>%
    layout(., yaxis = list(range = c(5.5, 0.5)))

These should match if the calculations were right… some values are being dropped. WHY?

filter(US_baby, Name == "Madison", Year == 1977)
filter(girl2, Name == "Madison", Year == 1977)

I thought it was because input was tmp_50yr instead of top2_50yr but that did not fix the problem. Examine each dataframe to see when data is dropped

filter(US_baby, Name == "Madison", Year == 1977) # this is taking into account boys names
filter(top2_50yr, Name == "Madison", Year == 1977) # still here
filter(girl2, Name == "Madison", Year == 1977) # missing

So, it’s lost in the creation of girl2. Where? Is top2_girls correct?

filter(top2_50yr, Sex == "Girl", Rank <= 2) %>%
    arrange(., Year, Rank) %>%
    .$Name %in% top2_girls %>%
    all()
## [1] TRUE

Appears to be. Now for girl2.

# is each name-sex combination present for 1967-2017 in top2_50yr?
all_yrs <- unique(top2_50yr$Year) %>%
        sort(.)
girl2 <- filter(top2_50yr, Name %in% top2_girls & Sex == "Girl")
group_by(girl2, Name) %>%
    summarize(., sum(Year %in% all_yrs, na.rm = TRUE))

Apparently only ‘Madison’ is missing. Why? Is it present in the original data?

filter(US_baby, Name == "Madison" & Sex == "F") %>%
    arrange(., Year)

Nope. Okay, so there are NO ERRORS… that’s just the way the data is. I guess ‘Madison’ was a very unpopular name for girls before 1984 and didn’t really even exist before 1970. Just out of curiosity… what about boys named ‘Madison’? and compared to girls?

madison <- filter(US_baby, Name == 'Madison') %>%
        arrange(., Year)

# boys name popularity
filter(madison, Sex == 'M') %>%
    ggplot(., aes(x = Year, y = n)) +
        geom_line()

# boys vs girls popularity
ggplot(madison, aes(x = Year, y = n, color = Sex)) +
    geom_line()

Similar trend but the name never really caught on for boys… probably because people overwhelmingly preferred it for girls.